PCS: An Efficient Clustering Method for High-Dimensional Data

نویسندگان

  • Wei Li
  • Cindy X. Chen
  • Jie Wang
چکیده

Clustering algorithms play an important role in data analysis and information retrieval. How to obtain a clustering for a large set of highdimensional data suitable for database applications remains a challenge. We devise in this paper a set-theoretic clustering method called PCS (Pairwise Consensus Scheme) for high-dimensional data. Given a large set of d-dimensional data, PCS first constructs ( d p ) clusterings, where p ≤ d is a small number (e.g., p = 2 or p = 3) and each clustering is constructed on data projected to a combination of p selected dimensions using an existing p-dimensional clustering algorithm. PCS then constructs, using a greedy pairwise comparison technique based on a recent clustering algorithm [1], a near-optimal consensus clustering from these projected clusterings to be the final clustering of the original data set. We show that PCS incurs only a moderate I/O cost, and the memory requirement is independent of the data size. Finally, we carry out numerical experiments to demonstrate the efficiency of PCS.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

An Efficient Cell-Based Clustering Method for Handling Large, High-Dimensional Data

Data mining applications have recently required a large amount of high-dimensional data. However, most clustering methods for the data miming applications do not work efficiently for dealing with large, high-dimensional data because of the so-called ‘curse of dimensionality’ [1] and the limitation of available memory. In this paper, we propose an efficient cell-based clustering method for handl...

متن کامل

Robust high-dimensional semiparametric regression using optimized differencing method applied to the vitamin B2 production data

Background and purpose: By evolving science, knowledge, and technology, we deal with high-dimensional data in which the number of predictors may considerably exceed the sample size. The main problems with high-dimensional data are the estimation of the coefficients and interpretation. For high-dimension problems, classical methods are not reliable because of a large number of predictor variable...

متن کامل

High-Dimensional Clustering Method for High Performance Data Mining

Many clustering methods are not suitable as high-dimensional ones because of the so-called ‘curse of dimensionality’ and the limitation of available memory. In this paper, we propose a new high-dimensional clustering method for the high performance data mining. The proposed high-dimensional clustering method provides efficient cell creation and cell insertion algorithms using a space-partitioni...

متن کامل

Feature Selection for Clustering

Clustering is an important data mining task Data mining often concerns large and high dimensional data but unfortunately most of the clustering algorithms in the literature are sensitive to largeness or high dimensionality or both Di erent features a ect clusters di erently some are important for clusters while others may hinder the clustering task An e cient way of handling it is by selecting ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008